In [1]:
import pandas as pd
import numpy as np
import matplotlib 
%matplotlib inline
matplotlib.pyplot.style.use = 'ggplot'

First, load up the data

First you're going to want to create a data frame from the dailybots.csv file which can be found in the data directory. You should be able to do this with the pd.read_csv() function. Take a minute to look at the dataframe because we are going to be using it for this entire worksheet.


In [2]:
data = pd.read_csv( '../../data/dailybots.csv' )
#Look at a summary of the data
data.describe()


Out[2]:
hosts orgs
count 4969.000000 4969.00000
mean 142.308110 15.31133
std 240.955294 21.15150
min 1.000000 1.00000
25% 10.000000 3.00000
50% 38.000000 7.00000
75% 187.000000 16.00000
max 2031.000000 122.00000

In [3]:
data['botfam'].value_counts()


Out[3]:
Sality         540
ConfickerAB    540
Ramnit         540
Zeus           540
zeroaccess     538
Bedep          535
Necurs         511
Zusy           498
PushDo         418
Olmasco        309
Name: botfam, dtype: int64

Exercise 1: Which industry sees the most Ramnit infections? Least?

Count the number of infected days for "Ramnit" in each industry industry. How:

  1. First filter the data to remove all the infections we don't care about
  2. Aggregate the data on the column of interest. HINT: You might want to use the groupby() function
  3. Add up the results

In [4]:
grouped_df = data[data.botfam == "Ramnit"].groupby(['industry'])
grouped_df.sum()


Out[4]:
hosts orgs
industry
Education 7492 3949
Finance 2281 1357
Government/Politics 4484 448
Healthcare/Wellness 544 473
Manufacturing 46303 1155
Retail 17649 558

Exercise 2: Calculate the min, max, median and mean infected orgs for each bot family, sort by median

In this exercise, you are asked to calculate the min, max, median and mean of infected orgs for each bot family sorted by median. HINT:

  1. Using the groupby() function, create a grouped data frame
  2. You can do this one metric at a time OR you can use the .agg() function. You might want to refer to the documentation here: http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once
  3. Sort the values (HINT HINT) by the median column

In [5]:
group2 = data[['botfam','orgs']].groupby( ['botfam'])
summary = group2.agg([np.min, np.max, np.mean, np.median, np.std])
summary.sort_values( [('orgs', 'median')], ascending=False)


Out[5]:
orgs
amin amax mean median std
botfam
ConfickerAB 27 122 63.035185 56.5 25.619679
Sality 2 43 15.909259 12.0 9.961946
Necurs 1 115 16.739726 11.0 16.644046
Ramnit 1 60 14.703704 9.0 14.485780
Zeus 1 36 9.988889 8.0 6.687004
Bedep 1 42 8.016822 6.0 6.192125
zeroaccess 1 19 5.481413 5.0 2.999787
Zusy 1 20 5.279116 4.0 4.204578
PushDo 1 12 2.882775 2.0 2.535719
Olmasco 1 6 1.592233 1.0 0.722148

Exercise 3: Which date had the total most bot infections and how many infections on that day?

In this exercise you are asked to aggregate and sum the number of infections (hosts) by date. Once you've done that, the next step is to sort in descending order.


In [6]:
df3 = data[['date','hosts']].groupby('date').agg(['count'])
df3.sort_values(by=[('hosts', 'count')], ascending=False).head(10)


Out[6]:
hosts
count
date
2016-07-06 59
2016-06-03 58
2016-08-16 58
2016-06-06 58
2016-06-07 58
2016-06-08 58
2016-06-09 58
2016-06-10 58
2016-06-22 58
2016-06-24 58

Exercise 4: Plot the daily infected hosts for Necurs, Ramnit and PushDo

In this exercise you're going to plot the daily infected hosts for three infection types. In order to do this, you'll need to do the following steps:

  1. Filter the data to remove the botfamilies we don't care about.
  2. Use groupby() to aggregate the data by date and family, then sum up the hosts in each group
  3. Plot the data. Hint: You might want to use the unstack() function to prepare the data for plotting.

In [7]:
filteredData = data[ data['botfam'].isin(['Necurs', 'Ramnit', 'PushDo']) ][['date', 'botfam', 'hosts']]
groupedFilteredData = filteredData.groupby( ['date', 'botfam']).sum()
groupedFilteredData.unstack(level=1).plot(kind='line', subplots=False)


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a0cfc88>

Exercise 5: What are the distribution of infected hosts for each day-of-week across all bot families?

Hint: try a box plot and/or violin plot. In order to do this, there are two steps:

  1. First create a day column where the day of the week is represented as an integer. You'll need to convert the date column to an actual date/time object. See here: http://pandas.pydata.org/pandas-docs/stable/timeseries.html
  2. Next, use the .boxplot() method to plot the data. This has grouping built in, so you don't have to group by first.

In [8]:
data.date = data.date  = pd.to_datetime( data.date )
data['day'] = data.date.dt.weekday
data[['hosts', 'day']].boxplot( by='day')


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a17fd30>

In [9]:
grouped = data.groupby('day')
grouped.boxplot('hosts')


/Users/cgivre/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py:3099: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  layout=layout, **kwds)
Out[9]:
{'boxes': [<matplotlib.lines.Line2D at 0x10a6de630>,
  <matplotlib.lines.Line2D at 0x10a6f1c50>,
  <matplotlib.lines.Line2D at 0x10a7085c0>,
  <matplotlib.lines.Line2D at 0x10a8b0ef0>,
  <matplotlib.lines.Line2D at 0x10a8c7860>,
  <matplotlib.lines.Line2D at 0x10a8df1d0>,
  <matplotlib.lines.Line2D at 0x10a17a748>,
  <matplotlib.lines.Line2D at 0x10a8fcf28>,
  <matplotlib.lines.Line2D at 0x10a914898>,
  <matplotlib.lines.Line2D at 0x10a92b208>,
  <matplotlib.lines.Line2D at 0x10a93bb38>,
  <matplotlib.lines.Line2D at 0x10a9524a8>,
  <matplotlib.lines.Line2D at 0x10a962dd8>,
  <matplotlib.lines.Line2D at 0x10a97b748>,
  <matplotlib.lines.Line2D at 0x10a9900b8>,
  <matplotlib.lines.Line2D at 0x10a9a3908>,
  <matplotlib.lines.Line2D at 0x10a9b70f0>,
  <matplotlib.lines.Line2D at 0x10a9cb898>,
  <matplotlib.lines.Line2D at 0x10a9ddfd0>,
  <matplotlib.lines.Line2D at 0x10a9f4828>,
  <matplotlib.lines.Line2D at 0x10aa05f60>],
 'caps': [<matplotlib.lines.Line2D at 0x10a6e6b70>,
  <matplotlib.lines.Line2D at 0x10a6ebac8>,
  <matplotlib.lines.Line2D at 0x10a6fdbe0>,
  <matplotlib.lines.Line2D at 0x10a6fdcf8>,
  <matplotlib.lines.Line2D at 0x10a70be10>,
  <matplotlib.lines.Line2D at 0x10a8aad68>,
  <matplotlib.lines.Line2D at 0x10a8bde80>,
  <matplotlib.lines.Line2D at 0x10a8bdf98>,
  <matplotlib.lines.Line2D at 0x10a8d37f0>,
  <matplotlib.lines.Line2D at 0x10a8d3ef0>,
  <matplotlib.lines.Line2D at 0x10a8e5a20>,
  <matplotlib.lines.Line2D at 0x10a8e9978>,
  <matplotlib.lines.Line2D at 0x10a8f1e48>,
  <matplotlib.lines.Line2D at 0x10a8f7da0>,
  <matplotlib.lines.Line2D at 0x10a908eb8>,
  <matplotlib.lines.Line2D at 0x10a908fd0>,
  <matplotlib.lines.Line2D at 0x10a91e828>,
  <matplotlib.lines.Line2D at 0x10a91ef60>,
  <matplotlib.lines.Line2D at 0x10a930a58>,
  <matplotlib.lines.Line2D at 0x10a9359b0>,
  <matplotlib.lines.Line2D at 0x10a946ac8>,
  <matplotlib.lines.Line2D at 0x10a946be0>,
  <matplotlib.lines.Line2D at 0x10a957cf8>,
  <matplotlib.lines.Line2D at 0x10a95dc50>,
  <matplotlib.lines.Line2D at 0x10a96dd68>,
  <matplotlib.lines.Line2D at 0x10a96de80>,
  <matplotlib.lines.Line2D at 0x10a97ef98>,
  <matplotlib.lines.Line2D at 0x10a985ef0>,
  <matplotlib.lines.Line2D at 0x10a996fd0>,
  <matplotlib.lines.Line2D at 0x10a99c7f0>,
  <matplotlib.lines.Line2D at 0x10a9ae7b8>,
  <matplotlib.lines.Line2D at 0x10a9aef98>,
  <matplotlib.lines.Line2D at 0x10a9bff60>,
  <matplotlib.lines.Line2D at 0x10a9c6780>,
  <matplotlib.lines.Line2D at 0x10a9d7748>,
  <matplotlib.lines.Line2D at 0x10a9d7f28>,
  <matplotlib.lines.Line2D at 0x10a9e8ef0>,
  <matplotlib.lines.Line2D at 0x10a9ed710>,
  <matplotlib.lines.Line2D at 0x10a9fbfd0>,
  <matplotlib.lines.Line2D at 0x10aa00eb8>,
  <matplotlib.lines.Line2D at 0x10aa11e80>,
  <matplotlib.lines.Line2D at 0x10aa11f98>],
 'fliers': [<matplotlib.lines.Line2D at 0x10a6f1b38>,
  <matplotlib.lines.Line2D at 0x10a702d68>,
  <matplotlib.lines.Line2D at 0x10a8b0dd8>,
  <matplotlib.lines.Line2D at 0x10a8c7748>,
  <matplotlib.lines.Line2D at 0x10a8daf98>,
  <matplotlib.lines.Line2D at 0x10a0ec358>,
  <matplotlib.lines.Line2D at 0x10a8fce10>,
  <matplotlib.lines.Line2D at 0x10a914780>,
  <matplotlib.lines.Line2D at 0x10a925fd0>,
  <matplotlib.lines.Line2D at 0x10a93ba20>,
  <matplotlib.lines.Line2D at 0x10a94bc50>,
  <matplotlib.lines.Line2D at 0x10a962cc0>,
  <matplotlib.lines.Line2D at 0x10a973ef0>,
  <matplotlib.lines.Line2D at 0x10a98af60>,
  <matplotlib.lines.Line2D at 0x10a9a37f0>,
  <matplotlib.lines.Line2D at 0x10a9b3f98>,
  <matplotlib.lines.Line2D at 0x10a9cb780>,
  <matplotlib.lines.Line2D at 0x10a9ddf28>,
  <matplotlib.lines.Line2D at 0x10a9f4710>,
  <matplotlib.lines.Line2D at 0x10aa05eb8>,
  <matplotlib.lines.Line2D at 0x10aa17f98>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x10a6ebbe0>,
  <matplotlib.lines.Line2D at 0x10a702550>,
  <matplotlib.lines.Line2D at 0x10a8aae80>,
  <matplotlib.lines.Line2D at 0x10a8c17f0>,
  <matplotlib.lines.Line2D at 0x10a8da160>,
  <matplotlib.lines.Line2D at 0x10a4a4b70>,
  <matplotlib.lines.Line2D at 0x10a8f7eb8>,
  <matplotlib.lines.Line2D at 0x10a90e828>,
  <matplotlib.lines.Line2D at 0x10a925198>,
  <matplotlib.lines.Line2D at 0x10a935ac8>,
  <matplotlib.lines.Line2D at 0x10a94b438>,
  <matplotlib.lines.Line2D at 0x10a95dd68>,
  <matplotlib.lines.Line2D at 0x10a9736d8>,
  <matplotlib.lines.Line2D at 0x10a98a048>,
  <matplotlib.lines.Line2D at 0x10a99c908>,
  <matplotlib.lines.Line2D at 0x10a9b30f0>,
  <matplotlib.lines.Line2D at 0x10a9c6898>,
  <matplotlib.lines.Line2D at 0x10a9dd080>,
  <matplotlib.lines.Line2D at 0x10a9ed828>,
  <matplotlib.lines.Line2D at 0x10aa00fd0>,
  <matplotlib.lines.Line2D at 0x10aa177b8>],
 'whiskers': [<matplotlib.lines.Line2D at 0x10a6de9b0>,
  <matplotlib.lines.Line2D at 0x10a6e6a58>,
  <matplotlib.lines.Line2D at 0x10a6f7b70>,
  <matplotlib.lines.Line2D at 0x10a6f7c88>,
  <matplotlib.lines.Line2D at 0x10a708da0>,
  <matplotlib.lines.Line2D at 0x10a70bcf8>,
  <matplotlib.lines.Line2D at 0x10a8b6e10>,
  <matplotlib.lines.Line2D at 0x10a8b6f28>,
  <matplotlib.lines.Line2D at 0x10a8ce780>,
  <matplotlib.lines.Line2D at 0x10a8cef98>,
  <matplotlib.lines.Line2D at 0x10a8dffd0>,
  <matplotlib.lines.Line2D at 0x10a8e5908>,
  <matplotlib.lines.Line2D at 0x10a8e9d68>,
  <matplotlib.lines.Line2D at 0x10a8f1d30>,
  <matplotlib.lines.Line2D at 0x10a903e48>,
  <matplotlib.lines.Line2D at 0x10a903f60>,
  <matplotlib.lines.Line2D at 0x10a9197b8>,
  <matplotlib.lines.Line2D at 0x10a919fd0>,
  <matplotlib.lines.Line2D at 0x10a92b978>,
  <matplotlib.lines.Line2D at 0x10a930940>,
  <matplotlib.lines.Line2D at 0x10a941a58>,
  <matplotlib.lines.Line2D at 0x10a941b70>,
  <matplotlib.lines.Line2D at 0x10a952c88>,
  <matplotlib.lines.Line2D at 0x10a957be0>,
  <matplotlib.lines.Line2D at 0x10a968cf8>,
  <matplotlib.lines.Line2D at 0x10a968e10>,
  <matplotlib.lines.Line2D at 0x10a97bf28>,
  <matplotlib.lines.Line2D at 0x10a97ee80>,
  <matplotlib.lines.Line2D at 0x10a990f98>,
  <matplotlib.lines.Line2D at 0x10a9967f0>,
  <matplotlib.lines.Line2D at 0x10a9a97b8>,
  <matplotlib.lines.Line2D at 0x10a9a9f98>,
  <matplotlib.lines.Line2D at 0x10a9b7f60>,
  <matplotlib.lines.Line2D at 0x10a9bf780>,
  <matplotlib.lines.Line2D at 0x10a9d1748>,
  <matplotlib.lines.Line2D at 0x10a9d1f28>,
  <matplotlib.lines.Line2D at 0x10a9e2ef0>,
  <matplotlib.lines.Line2D at 0x10a9e8710>,
  <matplotlib.lines.Line2D at 0x10a9f4fd0>,
  <matplotlib.lines.Line2D at 0x10a9fbeb8>,
  <matplotlib.lines.Line2D at 0x10aa0be80>,
  <matplotlib.lines.Line2D at 0x10aa0bf98>]}

In [ ]: